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SOLUTION OF A TRIDIAGONAL SYSTEM OF EQUATIONS 
ON THE FINITE ELEMENT MACHINE 


INTRODUCTION 

The solution of tridiagonal systems of equations on parallel computers has been 
the topic of many studies over the last ten years. Both direct and iterative solu- 
tions have been analyzed. In many cases, the analysis has been made on simulators or 
"theoretical" machines, in particular the paracomputer. The paracomputer is a theo- 
retical machine which has an unlimited number of processors and no overhead for com- 
munication between processors. There has been limited experience in analyzing these 
solutions on actual parallel computing systems. An experimental hardware/software 
array processor system in operation at NASA's Langley Research Center, the Finite 
Element Machine, was used to study the issues involved in implementing these solution 
techniques . 

The two algorithms which were run on the Finite Element Machine are the Acceler- 
ated Parallel Gauss method (ref. 1,2) an iterative method, and the Buneman algorithm 
(ref. 2), a direct method. Some of the issues involved when applying solution tech- 
niques in a parallel processing environment include: programming strategy, balancing 

the workload among the processors, and the amount of communication between the pro- 
cessors. This paper discusses these issues, analyzes the amount and type of overhead 
associated with the execution of the programs, and presents timing results obtained. 


FINITE ELEMENT MACHINE 
Hardware 

The Finite Element Machine, being developed at NASA Langley Research Center, is 
an experimental MIMD (multiple instruction, multiple data) system. The architecture 
is that of an array of microprocessors with each processor capable of being connected 
to 12 other processors by a local link and each processor also capable of communicat- 
ing to any other processor via a global bus. The FEM is connected to the outside 
world by a controller, on which programs are coded, compiled, debugged, and then 
downloaded to the array. Each processor has a copy of its own program, its own 
memory, and each can run asynchronously. A block diagram of the FEM is depicted in 
figure 1 . At present eight microprocessors are installed in the array. Each micro- 
processor consists of three boards, one for processing (CPU) and two for communica- 
tions (10-1 and 10-2). The CPU board is a 16 bit microprocessor with 16K of ROM 
(read only memory), 32K of RAM (randomly accessed memory), and an AM9512 floating 
point chip ( ref . 3 ) . 

The Finite Element Machine was initially designed for solution to structural 
analysis problems by the finite element and finite difference approximation methods. 
These types of problems have many calculations that can be solved simultaneously so 
they are ideally suited to parallel computing. Many methods used to solve such prob- 
lems are time-consuming iterative techniques. The FEM architecture is being studied 
as to its capability of supporting the efficient solution of other types of problems 
as well. 



Software 


Tho development of new oystom aoftwaro io Q3scntial to tho effective uaa of the 
special capabilities of any parallel computer. Tho Finite Element Machine has custom 
designed software to provide control and run-time support for programs written for 
the FEM. An operating system, NODAL EXEC, resides on each processor. One function 
performed by this system is tho definition of special areas called data areas, where 
data can be stored and saved during a session. Also residing on the processors are 
PASL1B routines, which are special function procedures which enable the processors to 
communicate, access data in data areas, and supply support for application programs 
(ref. 4). FEM ARRAY CONTROL SOFTWARE (FACS) supports program development, program 
execution, and analysis. FEM commands are implemented as control language procedures 
and are integrated into the host operating system. Commands are issued interactively 
at the terminal or submitted as batch stream commands (ref. 5). 

By the use of FACS commands, tho proper environment for executing a program on 
FEM is established. The hardware and software is initialized and the array configu- 
ration is selected. Data areas are defined and loaded and the program is downloaded. 

Programs may be run in one of two modes: synchronous or asynchronous. Both 

programs discussed here were run in synchronous mode. In this mode, input to each 
processor is queued in order until it is received. In running in asynchronous mode, 
only the most recent data are used. 


ALGORITHMS 

Matching the parallelism of an algorithm to the parallelism of the computer is 
an essential ingredient in obtaining cost-efficient results. The number of arithme- 
tic operations which can be performed simultaneously determines the parallelism of an 
algorithm. On a pipeline computer the parallelism is equal to the vector length. On 
an array processor the parallelism is measured by the number of operations done in 
parallel. This is not always constant throughout a solution. 

The algorithms developed in the past to solve large scientific problems have 
been mainly designed to run on conventional sequential computers. The number of 
arithmetic operations involved determined the speed in which problems were solved. 

The idea of solving problems in parallel introduces new concepts in speed and effi- 
ciency. Particularly, the problems of synchronization and communication must be 
addressed. New algorithms must be developed and old ones re-examined. A brief dis- 
cussion of tridiagonal systems and the algorithms investigated in the study is 
summarized below. 


Tridiagonal Systems 

Many problems being solved in the scientific community involve the solution of 
tridiagonal systems of equations such as the system solved in this study of the form 
(see fig. 2), 


AX = Y 
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whore X and Y are the column vootora of {x^} and {y^, 1 ■ 1,2,...,n with the 

{y } given. The matrix A has diagonal elemants {dj_}, i ■ 1,2,...,n (in this case 
{d*} is taken to be unity), subdiagonal elements {A^}, i ■ 2,3,...,n, and super- 
diagonal oloments {u^}, i ■ 1,2,...,n-1. These systems often result from finite- 
difference approximations to second order differential equations as in Holmholtz, 
Laplace, Poisson, and diffusion equations (ref. 1). Both direct and iterative 
methoda are used to solvo ouch systems* While iterative methods soem to parallelize 
boat, direct methods aro ofton the best choice for efficiency in the overall problem 
solution* Thus investigations are needed of both methods* Although many techniques 
have boon designed to solvo a tridiagonal system of equations, most of these have 
been developed to run on sequential computers and are not able to be applied effi- 
ciently on a parallel computer* In many methods the values of the latest terms 
depend on previously computed terms which make them inherently sequential* There- 
fore, new approaches aro being studied to introduce parallelism into tridiagonal 
solution methods. In practice, many such systems may be solved where A would stay 
tho same and only the Y would vary. 


Accelerated Parallel Gauss 

Tho Accelerated Parallol Gauss Method was developed by Heller, Stevenson, and 
Traub (ref. 1). Tho method is an improved version of Traub's Parallel Gauss algo- 
rithm and is outlined in tho appendix. 

The classical Gaussian elimination algorithm for tridiagonal system may be 
stated as follows. 

1. (Factor A - (I + L)D(I + R).) 

Let d^ * 1 . For i ® 2, 3, . . * , n, let d^ — 1 — ^^i — 1 * 

2. (Solve (I f L)f - c.) 

Sot l, «• a./d. - for all i > 1. Let f 1 «* c 1 . For i ■ 2,3,...,n, 

let fj - c I - 

3. (Solvo (I + R)x = D (_1) f.) 

Set r “ u^/d^ a ^d g^ = f i/ d i for all 1* Let (* n " ?n^’ 

For i ■ n - 1 ,n - 2, . . . , 1 , let x j. “ “ r^xj^ . 

This classical algorithm is sequential because each calculation depends on previously 
calculated results. This method was converted into an iteration method by succes- 
sively approximating the d, then the f, and finally the x. For example, 


d| i) » 1 - i i b i _ 1 /d^“ 1) for all i > 1, 


where the superscripts denote the iteration step (appendix). 

By using successive approximation we increase the amount of computation. 
Although the number of arithmetic operations increases, the fact that many computa- 
tions take place in parallel can make the algorithm more timo efficient on some 
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parallel computers for a given number of equations. As implemented on FEM, all pro- 
cessors are busy all the time and all iterations are carried out in parallel. The 
APG method is efficient for certain types of problems; however, under certain condi- 
tions this algorithm is susceptible to round-off errors and the rate of convergence 
is sometimes slow. 

The APG method consists of three steps, solving for d, solving for f, and then 
using bach substitution to solve for x. The number of iterations for each step is 
determined by the convergence criteria (ref. 2). 


Buneman Algorithm 

The Buneman algorithm, developed by 0. Buneman (ref. 2), is a variant of cyclic 
reduction and is outlined in the appendix. A direct solution is found by combining 
adjacent terms to obtain a relation between every other term, every fourth term, 
every eighth term, etc. For a system of N equations, log 2(N) - 1 levels of 
reduction are carried out until the relation is between known values (the boundary 
conditions). All unknowns are then found by back substitution. The algorithm is 
numerically stable and the answer is correct to within the round-off error. The 
answer is obtained in a predetermined number of steps. 

The application run on the FEM assigned several equations to each processor. 

The ratio of the amount of idle time and the communication time to the amount of 
computation time decreased as the size of the problem was increased. 


PROGRAM DEVELOPMENT 
Problem Mapping 

One of the basic issues facing a programmer of a parallel computer is the map- 
ping of the problem to the particular architecture of the computer. The workload 
should be equally distributed among the processors and the amount of communication 
should be minimized. In the APG case each processor only has to communicate with at 
most two of its nearest neighbors. In the Buneman case, as the number of processors 
increases, so does the number of processors one must communicate with increase. The 
test cases were run on up to eight processors and each processor was connected to 
every other processor by both a local link and the global bus. 

The overhead of communication on the Finite Element Machine is high compared to 
the computation speed. This is a function of the software which provides input buf- 
fers and allows up to 255 words and several types of data to be exchanged. 


Mapping of the Accelerated Parallel Gauss Method 

The primary consideration in the program design was to balance the communication 
with the computation. For this reason, the equations were ordered in a way that 
minimizes the time a processor waits for a transfer. 

For the APG method (see appendix), the initial values for the solutions of the 
odd equations of d and f are known; hence, these equations are solved first. 

They are solved in descending order so that the last value is the one that must be 
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passed to its neighbor. The equations were arranged in such a way as to compute the 
last odd first and the first even last. 

At every iteration, each processor has at most one receive and one send to its 
nearest neighbors. By increasing the number of equations to be solved in each pro- 
cessor, the computation time increases while the transfer time stays the same. For 
the algorithm to be most efficient, enough computation must be assigned to each pro- 
cessor so that time is not spent waiting for a transfer. Figure 3 shows a mapping of 
20 equations on four processors. 

For the iterations on d and f, the first processor only sends and the last 
processor only receives. This is reversed for the x iterations. This test case is 
extensible to a larger set of processors solving a larger system of equations. The 
order of solving equations in a system and the maximum number of equations per pro- 
cessor were issues addressed in the implementation of this method. 


Mapping for Buneman Algorithm 

The Buneman algorithm is a parallel direct method and in contrast to iterative 
methods, the solution is found in a predetermined number of steps. For a given num- 
ber of equations, the same total number of calculation steps is independent of the 
number of processors. The number of transfers, however, varies with the number of 
processors. For an efficient approach, several equations should be assigned to each 
processor so that the amount of computation in each processor outweighs the communi- 
cation tasks. 

During the reduction stage, each processor sends results to its nearest neighbor 
(with the first only sending to the right and the last only sending to the left) 
until there is only one equation to solve in each processor. From this level, the 
number of working processors is divided in half until the reduction is down to one 
equation. During the back substitution stage, the communication is reversed. Each 
processor only has to receive one value. From that stage on, the processor has all 
the information it needs to solve the equations assigned to it. 

The application program written for FEM assigned a middle processor to compute 
the last equation. This minimized the number and distance of the transfers. 

Figure 4 shows the mapping of 16 equations on four processors. 


RESULTS 

Figure 5 gives actual timing results for the two algorithms implemented on FEM 
for Nth order test problems. Tables 1 and 2 give timing results for the two algo- 
rithms where parallel efficiency and speedup are defined as: 


. . , Time to run o n one processor 

Parallel e c ency - NpE * time to run on NPE processors 


_ Time to run on one processor 
peedup - T ^ me to run on N p E processors 

NPE = number of processors 
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Execution Statistics for APG 


Where: 

K » number of iterations for d 
L = number of iterations for f 
M a number of iterations for . x 
N = number of equations in system 
K=50, L=100, M=100, N=1 00 


Number of processors 

Milliseconds 

Efficiency 

Speedup 

1 

33215 



2 

17436 

0.9525 

1.9 

4 

9758 

.851 

3.4 

8 

6129 

.6774 

5.4 


Table 1 


Execution Statistics for Buneman 


Where: 


N=1 28 


Number of processors 

Milliseconds 

Efficiency 

Speedup 

1 

3982 (est) 



2 

2238 

0.89 

1.78 

4 

1211 

.82 

3.29 

8 

672 

.74 

5.92 


Table 2 


ESTIMATED TIMING RESULTS 

An estimated speedup was computed by analytically modeling the APG and Buneman 
algorithm programs as run on PEM. These formulas, summarized in the following sec- 
tion, are not expected to give exact results as each formula only takes into consid- 
eration the number of arithmetic operations and the number of transfers. Embedded in 
the program is also a decision-making code which includes tests to determine whether 
a value is to be sent or received. Table 3 compares the expected speedup of APG when 
run on 2, 4, and 8 processors as compared to the estimated and actual time to run on 
one processor. Table 4 gives similar results for the Buneman algorithm. Figure 6(a) 
shows plots of actual versus estimated times it would take to compute the equations 
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on a sequential computer with the same speed as the PEM. Figure 6(b) shows the 
actual times compared to estimated times with the communication time for each proces- 
sor included* 


Accelerated Parallel Gauss 

The total time for an APG solution is estimated to be 
N[2*T a (K + L + M) + 8*T a ] + 2T t (K + L + M) 


where 


T a = 500 
T fc = 1870 
N = 100 
K = 50 
L = 100 
M = 100 


Time for an arithmetic operation in microseconds 

Time for a transfer (either a send or a receive) in microseconds 

Number of equations in system 

Number iterations to solve for d 

Number iterations to solve for f 

Number iterations to solve for x 


(The 8 arithmetic operations include normalizing £ and u, initializing y and x, 
and computing £*u, l, g, and r. The two transfers are for a send and a receive 
for each processor, even though some only send or receive.) 

Estimated Solution Time in milliseconds for the baseline APG problem (N=100) is 
On one processor: 25400 

Two processors: 13167.5 

Four processors: 7285 

Eight processors: 4531.5 



Actual speedup vs estimated speedup 

Two processors 

Four processors 

Eight processors 

Estimated speedup 
Actual speedup 

. 25400 _ 

13167.5 1,929 

33215 _ 

17436 " 1,905 

25400 „ 

7285 ’ 3 ' 486 

33215 , ... 

9758 = 3,404 

^ 4 ?° - 5.618 

4531 .5 

33215 P 

6129 ‘ 5 ' 4 ' 9 


Table 3 
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Buneman Algorithm 


The total time for the Buneman Solution is estimated from 

N e * (12621) + N g10 * T s10 + N r * T r10 + N g * T bs + N g1 * T g1 + N r1 * T g1 

Where: N e = Number of equations to solve in each processor 

(To solve a system of 128 equations, 255 equations are solved) 

N g10 » Number of sends of ten words each 

N r10 = Number of receives of ten words each 

N gl = Number of sends of one word each 

N ri = Number of receives of one word each 

T g10 = 3910 Time for a send of ten words 

T r10 " 9970 Time for a receive of ten words 

T gl = 1870 Time for a send of one word 

T r1 = 6160 Time for a receive of one word 

Estimated solution time in microseconds for the baseline Buneman problem (N=128) 

One processor: 3981 .8 

Two processors: 2062.5 

Four processors: 1132.4 

Eight processors: 623.5 



Table 4 
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CONCLUSIONS 


Two algorithms for the solution of tridiagonal systems were implemented on the 
Finite Element Machine. Many variables influence the speedup that can be achieved 
for a given algorithm on the Finite Element Machine. Some of the issues addressed 
were: the size of the problem* the number of processors for a given case* the ratio 

of the computation time to the communication time, and the balancing of the workload 
among the processors. The amount and type of overhead was analyzed. 

Execution statistics obtained by running these test cases on FEM were presented. 
By allowing computations to take place in parallel* the time to solve a large system 
of equations can be greatly reduced. Although the rate of speedup tends to decrease 
as more processors are added, the cost-effectiveness of the addition of processors 
needs to be taken into consideration. In some cases, additional processors may be 
necessary to achieve timely or cost effective results. 

For problems where many tridiagonal systems need to be solved, the most effi- 
cient method may be for each processor to solve its own system with little or no 
communication with its neighbors. In a parallel processing environment there is 
still the problem of communicating with the controller to download data and upload 
results. 

The problems summarized in this study should serve as benchmarks in evaluating 
other algorithms as well as making comparisons of timing, changes in the system, 
new software, different modes of operation, and various ratios of computation to 
communication. 
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APPENDIX 

ACCELERATED PARALLEL GAUSS ALGORITHM 
BUNEMAN ALGORITHM 


ACCELERATED PARALLEL GAUSS* ALGORITHM 


1. Let » 1 for all k. Then for k ■ 1, 2, ... K 

o 


,(k) 


1 - % 


A 1 •« 

(k) 


d* k) = 1 - i even 


2. Define 4 


i - 


for i > 0. Let 


f? = for all i 


f (k) = y for all k 
o o 


Then for k = 1, 2, ••• L 


f i° = y i “ £ i f i-i 1); 1 066 


f{ k) ■ 1 even 


3. Define 

qi = f[^ ) /d^ ) for all i 

for all i < N 

4. Set 

x{ o) = f[ £) /d{ k) for all i 

and 

= gN for all k 


*D. E. Heller, D. K. Stephenson, and J. P. Traub, Carnegie Mellon University, 
Pittsburgh, Pennsylvania. 



Then for k = 1, 2, ... 


M 


.00 


(k-1) 

“1 

a q i - r i 

V 

1+1 

.oo 


OO 

1 

1! 

1 

H* 

x i+1 ; 


i even 


odd 



are then an approximation to the true solution x. 
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BUN EM AN ALGORITHM* 


1 . Define for i ■ 0 , 1 , ... N 
A (o) - 1 P (0) 

*i h p i 


d (°) H d 
d i “ d i 


f .<°> - 


u l o) “ u i 


Then for K •» 0 , 1 , ... n -1 


„(k+1) 0 (k) c (k) .<k) 

*. * v x i a k 

1 i-2 K 1 i+2 


( k + i, . no d oo „»> + ,0.) m jio 

1 1 i+2 K i-2 k i+2* i-2 K 


. (k+1 ) 


- d (k > 

d (k) 

d <k> 


i-2 k 

i 

i+2 k 


.(k) , 

(k) „ 

(k) 


“ d v u 
i-2 k 

i 

i+2 k 


(k) 

( (k) 

. (k) (k) 

(k) 

= Pi + 


" *i P v ' 
1 i-2 

■ u. p 

i+2 


•)/■ 


.(k) 
ki/ d i 


k+1 , k.(k) (k) .(k) (k) (k) 

[. d k q k +d .u q . 

1 1 i+2 K i-2 K i-2* 1 i+2* 




(io d oo u oo + t oo d 

1 + 2 * 1 - 2 * 1 + 2 * 




(k+1) 


. . . . . * _k+1 , - , -n-k+1 

with i ■ 3 * 2 , j = 0, 1,... 2 


* 0 . Buneman, Stanford University Institute for Plasma Research. 



2 * 


d < n +1> 


. (n) 

%n 

2 


(n) 

2 n+l 



. (n) 
* 2 n+1 







P 


(n+1 ) 




f (n) 

2 " 





_(n+1 ) 
* „ 


f (n) 

2 “ 










+ A 




u 


(n)\ (n+1 ) 

2" JV 


This completes the reduction stage. 


3. In the back-substitution stage: 


X - P <n+1) + q (n+1) /d (n+1) 


n+1 ) l^ini 
n * _n %n / 2 n 


P 

2 “ 2 


Then, 


*= - n) + ^ (n> - “ (n) * Va (n) 

o o Po o 2 n '/ ° 




V a J«> x (J n) >) \ / _ (n > 

2 n +1 2 n+1 ( 2 n+1 " 2 n+1 V )/ 2 n+1 


Finally, for k » n-1 , n-2, ... 1, 0, 


t - D (k) + (a M - 

i P 1 \J*t *1 Jt i _ 2 k u i 


00 


with i ■ 2*, 3 • 2 k . . . 
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FINITE ELEMENT MACHINE BLOCK DIAGRAM 



Figure l. Finite Element Machine block diagram. 














TRIDIAGONAL SYSTEM 



DIAGONAL ELEMENTS = 1 

li = LOWER DIAGONAL ELEMENTS 

ui = UPPER DIAGONAL ELEMENTS 




Figure 2. Tridiagonal system of equations. 
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Figure 4. Mapping of 16 equations on four processors (Buneman). 
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Figure 5. Timing results for the two algorithms implemented on FEM. 
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